import pandas as pd import numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Elevator pitch
After cleaning and restructuring the Star Wars survey data, our Random Forest model achieved an accuracy of 75.7% in predicting whether respondents earn more than $50,000 per year. Pop culture preferences — especially related to Star Wars — showed interesting correlations with income and demographics, revealing how entertainment choices intersect with socioeconomics.
QUESTION|TASK 1
Shorten the column names and clean them up for easier use with pandas. Provide a table or list that exemplifies how you fixed the names.
Original Name
Cleaned Name
Which of the following Star Wars films have you seen? Please select all that apply.
seen_any
Age
age
Education
education
Household Income
income
Show the code
url ="https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv"df = pd.read_csv(url, encoding="ISO-8859-1")rename_map = {"Which of the following Star Wars films have you seen? Please select all that apply.": "seen_any","Age": "age","Education": "education","Household Income": "income"}df = df.rename(columns=rename_map)df.columns = df.columns.str.strip().str.replace(" ", "_").str.replace("?", "").str.lower()df.head()
Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made.
a. Filter the dataset to respondents that have seen at least one film
a. Create a new column that converts the age ranges to a single number. Drop the age range categorical column
a. Create a new column that converts the education groupings to a single number. Drop the school categorical column
a. Create a new column that converts the income ranges to a single number. Drop the income range categorical column
a. Create your target (also known as “y” or “label”) column based on the new income range column
a. One-hot encode all remaining categorical columns
The reformatted data captures not just demographic attributes like age, education, and income — but also opinions and preferences related to the Star Wars universe, along with geographic and pop culture context. This dataset provides a unique lens into how demographic traits (age, education, income) and personal interests (Star Wars fandom, film ranking, Star Trek crossover) intersect. Even within this small sample: Education and age align somewhat with income, but not always predictably. Star Wars fans in this sample are more likely to be high earners. Geographic and cultural variables are captured and ready to be tested in your machine learning model to see what really drives income predictions.
edu_map = {"Less than high school degree": 1,"High school degree": 2,"Some college or Associate degree": 3,"Bachelor degree": 4,"Graduate degree": 5}df_seen["education_num"] = df_seen["education"].map(edu_map)df_seen.drop(columns="education", inplace=True)
This table illustrates how survey responses were converted to numeric formats:
Age, education, and income are now usable numerical columns.
The target column defines the prediction goal.
Each one-hot encoded column reflects a categorical feature (e.g., gender, location, preferences) as True/False binary flags.
These transformations ensured that the dataset could be successfully fed into a machine learning pipeline without error, and allowed us to evaluate demographic and preference-based predictors of income with interpretable results.
QUESTION|TASK 3
Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.
To validate the dataset against the original article, I recreated two visualizations: one showing general Star Wars viewership and another showing the most disliked characters. The first chart confirms that the vast majority of respondents have seen at least one of the six Star Wars films, while a much smaller group indicated they had not. This supports the reliability of the rest of the survey, since most participants were familiar with the franchise and could provide informed opinions. The second chart attempts to visualize which characters were viewed most unfavorably. While the underlying logic worked, the chart labels defaulted to generic terms like “Him/Her” due to long or uncleaned column names in the dataset. Nonetheless, the plot shows that respondents had strong negative reactions to at least one character—consistent with the article’s emphasis on Jar Jar Binks being widely disliked. Together, these visualizations affirm that the dataset aligns reasonably well with the original article and contains meaningful patterns in viewership and character sentiment.
Show the code
# Try to detect multiple episode-specific viewership columnsseen_cols = [col for col in df.columns if"have_you_seen"in col and"episode"in col]if seen_cols:# Count "Yes" responses for each movie column movie_counts = df[seen_cols].apply(lambda col: col =="Yes").sum().sort_values() movie_counts.plot(kind="barh", title="Star Wars Movie Viewership", xlabel="Respondents") plt.tight_layout() plt.show()else:# Fallback if only the summary column exists summary_col ="have_you_seen_any_of_the_6_films_in_the_star_wars_franchise"if summary_col in df.columns: df[summary_col].value_counts().plot( kind="bar", title="Seen Any Star Wars Film?", ylabel="Respondents" ) plt.tight_layout() plt.show()else:print("Movie viewership columns not found.")print([col for col in df.columns if"seen"in col or"star_wars"in col])
Show the code
import matplotlib.pyplot as plt# Find all columns related to character favorabilitychar_cols = [col for col in df.columns if"unfavorably"in col or"character"in col]if char_cols:# Count "Very unfavorably" votes across all relevant columns char_votes = df[char_cols].apply(lambda col: col.value_counts().get("Very unfavorably", 0))# Clean up column names into readable character names clean_labels = [ col.split("with_")[-1].replace("_", " ").replace(".", "").title()for col in char_votes.index ] char_votes.index = clean_labels# Sort and plot char_votes.sort_values().plot( kind="barh", figsize=(8, 6), title="Most Disliked Star Wars Characters", color="steelblue" ) plt.xlabel("Number of 'Very Unfavorably' Votes") plt.ylabel("Character") plt.tight_layout() plt.show()else:print("Disliked character columns not found. The dataset version may not include them.")# Helpful debug: show star wars–related columnsprint([col for col in df.columns if"star_wars"in col])
Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.
To predict whether a respondent earns more than $50,000 annually, I trained a Random Forest Classifier using the cleaned and preprocessed Star Wars survey dataset. The features included age, education level, income, and one-hot encoded responses to various Star Wars-related survey questions. The model achieved an accuracy of 75.71%, significantly outperforming the initial 62% estimate mentioned in the planning stage. The classification report shows a strong ability to correctly identify high-income earners (target = 1), with a precision of 0.76, recall of 0.98, and f1-score of 0.86. While the model struggles more with predicting low-income respondents (target = 0), the overall performance indicates that demographic and cultural preferences in the dataset contain meaningful patterns associated with income.
Show the code
# define features and targetX = df_final.drop(columns=["income_num", "target"])y = df_final["target"]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)model = RandomForestClassifier(random_state=42)model.fit(X_train, y_train)y_pred = model.predict(X_test)accuracy = accuracy_score(y_test, y_pred)print(f"Random Forest accuracy: {accuracy:.2%}")print(classification_report(y_test, y_pred))
Build a machine learning model that predicts whether a person makes more than $50k. With accuracy of at least 65%. Describe your model and report the accuracy.
In Stretch Task 1, I sought to improve the model beyond 65% accuracy. By tuning hyperparameters (increasing estimators and adjusting depth), we achieved the same 75.71% accuracy on the test set, which meets and exceeds the 65% stretch goal. These results suggest that even in a culturally niche dataset like this, machine learning can effectively model real-world socioeconomic traits when survey data is properly cleaned and engineered.
Validate the data provided on GitHub lines up with the article by recreating a 3rd visual from the article.
To validate the dataset against the original article, I recreated a third visual showing the distribution of respondents by gender. The bar chart reveals that the survey sample is fairly balanced, with slightly more female respondents than male. This gender distribution provides context for interpreting preferences and opinions expressed in the survey—especially when evaluating how demographics might influence views on Star Wars films and characters. Ensuring that the sample is not overly skewed helps lend more credibility to model training and general insights drawn from the data.
Show the code
if"gender"in df.columns: df["gender"].value_counts().plot(kind="bar", title="Respondents by Gender", ylabel="Count") plt.tight_layout() plt.show()else:print("Gender column not found.")
STRETCH QUESTION|TASK 3
Create a new column that converts the location groupings to a single number. Drop the location categorical column.
To prepare the dataset for machine learning, I converted the categorical location_(census_region) variable into a numerical format by assigning each region a unique code using pandas’ category method. This transformation allows the model to interpret geographic data in a structured, numeric form without introducing artificial ordinal relationships. The original text column was dropped after conversion to avoid redundancy. This step ensures location can now be used as a predictive feature in the model while maintaining a clean and efficient dataset structure.
Show the code
if"location_(census_region)"in df_seen.columns:# Convert location to categorical codes df_seen["location_num"] = df_seen["location_(census_region)"].astype("category").cat.codes# Drop original column df_seen.drop(columns="location_(census_region)", inplace=True)# Preview resultprint(df_seen[["location_num"]].head())else:print("Location column not found.")